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Abstract 

We develop an effective non-hierarchical data clustering method using an analogy to the dynamic 
coarse-graining of a stochastic system. Analyzing the eigensystem of an inter-item transition matrix 
identifies fuzzy clusters corresponding to the metastable macroscopic states (macrostates) of a 
diffusive system. A novel "minimum uncertainty criterion" determines the linear transformation 
from eigenvectors to cluster-defining window functions. Eigenspectrum gap and cluster certainty 
conditions identify the proper number of clusters. The physically-motivated fuzzy representation 
and associated uncertainty analysis distinguishes macrostate clustering from spectral partitioning 
methods. Macrostate data clustering solves a variety of test cases that challenge other methods. 
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I. INTRODUCTION 



Finding subgroups or clusters within large sets of items is a problem that occurs in many 
contexts from astronomy to integrated chip design and pattern recognition ([1, 2, 3, 4] for 
reviews). DNA microarray gene expression analysis and bioinformatic sequence comparisons 
are recent and important applications in molecular biology [3, 5]. 

The clustering problem may be posed in two ways: In the first case (e.g., DNA microar- 
rays), Nm measurements are made on each of the N items. The N x Nm measurement 
matrix X is then used in a problem-specific manner to compute a symmetric N x N dis- 
similarity matrix D. Each off-diagonal element provides an inverse indicator of the 
correlations between the measurements of items % and j. A straightforward method is to set 



where g is a problem-specific Euclidean metric tensor. This allows preconditioning of the 
scales of the different measurements and, by using non-diagonal g, adjustment for measure- 
ment correlations (particularly important if some measurements are replicates). Statistical 
noise and complexity can be reduced by using singular value decomposition to pre-identify 
principal components of X that span much of the variation in the measurement space. This 
facilitates identification of clusters "by eye" or with various heuristics (e.g., [6, 7, 8]). 

In the second case (e.g., pairwise gene sequence comparisons), the primary data are 
measures of dissimilarities between pairs of items: In this case D is defined, but there is no 
measurement matrix X and the elements of D may not satisfy the triangle inequality. 

Early clustering methods were "hierarchical," generating open binary trees which can (de- 
pending on the selected cross-section) dissect the items into any number of clusters between 
1 and N . In these methods, the choice of the optimal number of clusters is an independent 
problem [2, 9, 10]. "Agglomerative" hierarchical methods iteratively join items together 
to form a decreasing number of larger clusters; "divisive" hierarchical methods use succes- 
sive subdivision to generate an increasing number of smaller clusters. While agglomerative 
methods can be inexpensive, they usually use only local and not global information, which 
weakens performance [2] . While divisive methods can use global information, they can have 
high complexity in N and thus can be too expensive for large problems. A weakness of 
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both types of hierarchical methods is that they cannot repair defects from previous stages 
of analysis. 

Some clustering methods are based on analogies to physical systems in which macroscopic 
structure emerges from microscopically-defined interactions. A number have used analogies 
to statistical mechanical phase transitions [11, 12, 13, 14, 15], while others have used chaotic 
[16] or quantum mechanical [17] systems as analogs. Most of these have the advantage of 
being "fuzzy" — in addition to assigning items to clusters, they provide a continuous measure 
of the probability or strength of the assignment of each item. 

Clustering can also be performed by objective function optimization. If there is an a 
priori model for the structure of the clusters in the measurement space (e.g., as a collection 
of Gaussians), then a corresponding parametric objective function can be used. Otherwise 
a non-parametric objective function may be useful. For example, graph theory clustering 
methods treat the items as nodes of a graph whose interconnecting edges have "affinities" 
or "weights" determined from D ([18, 19], for review). They typically use "min-cut" or 
"normalized-cut" objective functions and search for the (sometimes "balanced") graph par- 
titioning that minimizes the (sometimes normalized) sum of the weights of the cut edges. 
"Spectral graph theory" [20] methods use the eigenvectors of the affinity matrix (or the 
closely related generalized Laplacian matrix) to assist the process. Spectral bipartitioning 
([21], for history and review), which uses one eigenvector, can be applied recursively for hier- 
archical dissection [22] ; and the development of non-hierarchical methods for the concurrent 
use of multiple eigenvectors is an active topic of research ([19, 23], for reviews). 

We present here a novel, non-hierarchical, fuzzy clustering method which uses an anal- 
ogy between data clustering and the coarse-graining of a stochastic dynamical system. The 
items are regarded as microstates that interact via a dynamical transition matrix T, which 
is derived from D. Clusters are identified as the slowly- relaxing metastable macroscopic 
states (macrostates) of the system. These are computed by concurrently using multiple 
eigenvectors of Y in the same way that macrostates of a continuous diffusive system are 
identified from the eigenfunctions of the Smoluchowski operator [24]. The number of clus- 
ters is algorithmically determined by the spectral properties and cluster overlap criteria. 
We demonstrate that the method can solve difficult problems, including ones in which the 
clusters are irregularly shaped and separated by tortuous boundaries. 
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II. METHOD 



A. Macrostates and stochastic coarse-graining; a brief overview 

Coarse-graining is used in nonequilibrium statistical physics to reduce the dimensional- 
ity and complexity of the dynamical description [25]. In the usual situation, the system 
is initially described microscopically by a fine-grained first-order equation specified over a 
configuration space of microscopic states (microstates) . Microscopic degrees-of-freedom cor- 
responding to very rapid motions are removed by (possibly non-linear) projection. This 
generates a coarse-grained master equation with fewer degrees-of-freedom that describes the 
slower dynamics of the system's macrostates. Each macrostate corresponds to a subregion of 
configuration space that has been projected onto one value of the macroscopic parameters. 
For example, to describe Brownian motion of pollen in water, the (fast) water molecule 
degrees-of-freedom are projected out leaving only the (slow) coordinate of the pollen to 
parameterize the macrostates. In this example the macrostates are continuously parame- 
terized, but they may also be discrete. For example, to describe chemical reactions, each 
macrostate is a chemical state, a subregion of conformation space which includes all vibra- 
tions, translations, and rotations of a specific metastable, covalently-bonded arrangement of 
atoms. 

Coarse-graining projections are not arbitrary: the utility of the resultant macroscopic 
description depends upon the existence of a sufficient disparity between r local , the time- 
scale of the fast (projected-out) motions (which generate ergodicity within the macrostate), 
and r global , the time-scale of the remaining slow motions (which are required for ergodicity 
between macrostates). Appropriate projections can sometimes be chosen heuristically when 
the disparity between r local and r global is large and subjectively obvious. When this is not 
so, projections into discrete macrostates can be selected by analyzing the eigenspectrum of 
the microscopic stochastic dynamics. This procedure is described in detail in Refs. 24 and 
26. We summarize the salient points here. 

Consider the example illustrated in Fig. 1A of a thermal ensemble of systems having 
microscopic parameter x and potential energy V(x). The bimodal equilibrium probability 
density is given by the Gibbs-Boltzmann distribution p cq (x) oc exp[— f3V(x)], where (5 is 
the inverse temperature in inverse energy units. If system dynamics are overdamped (i.e., 
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diffusive), then the nonequilibrium probability distribution p(x; t) evolves in time according 
to the first-order Smoluchowski equation 



where T is the kernel of an operator determined by V , the temperature, and the diffusion 
constant. Once the eigenf unctions and eigenvalues of T have been determined, the general 
solution to Eq. (2) can be expanded as 



where the eigenvalues and right-eigenfunctions of T are — j n and (p n (x), respectively, and 
the expansion coefficients c n are determined by the initial conditions p(x; 0). (Without loss 
of generality we normalize ipo so that Cq = 1, and assume that eigenfunctions are ordered 
according to the magnitudes of their eigenvalues.) 

We always have 7o = and (po(x) = p cq (x), corresponding to the stability of the Gibbs- 
Boltzmann distribution. The other 7„ are non-negative and determine the rates of relaxation 
towards this equilibrium state. While ip is non-negative everywhere, the other eigenfunc- 
tions take both positive and negative values, and the exponential decays of their amplitudes 
generate probability fluxes. For illustration, Fig. f A, panel b displays (for a selected tem- 
perature) v?o and the "slow" right-eigenfunction </?i. If c\ > 0, p(x; 0) will have a probability 
excess (relative to p cq ) for x < and a deficiency for x > 0. These overall deviations from 
equilibrium will decay away as exp(— jit) — > 0. The "fast" eigenfunctions <p n> i will have 
more nodes than ipi and their more rapid decays will transport probability over smaller 
regions. 

Sufficiently large potential energy barriers can separate configuration space into localized, 
dynamically metastable macrostate regions, each having the property that r local , the time 
scale for relaxation of p(x; t) towards p eq (x) within the region, is much less than r global , the 
time-scale for probability to enter or leave the region. r local and r global are determined by the 
eigenvalues, and a disparity between them will correspond to a gap in the eigenspectrum. 
If there are m macrostates, a gap will occur between 7 m _i and 7 m : there will be m slow 
modes generating inter-macrostate probability equilibration, and the remaining fast modes 
will generate intra-macrostate relaxations. 
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For example, in Fig. 1A the energy barrier centered at x = separates configuration space 
into two macrostates a and b (roughly containing the regions x < or x > 0, respectively). 
Correspondingly, there is a spectral gap 7i C 72; so m = 2. 7i is the rate of the slow 
transfer of probability between a and b that is generated by the slow decay of the amplitude 
of (pi. Thus, r global ~ 7-f 1 . The larger values of the 7 n >i correspond to the fast decays of 
the more-rapidly varying y?„>i, corresponding to intra- macrostate probability relaxations. 
The slowest of these rates, 72, determines the time needed for local equilibration. Thus, 

T local ^ 7 -l_ 

In this simple case, it is tempting to "crisply" define the macrostates by inspection 
as the regions x > and x < 0. However, this is inapt for two reasons: (1) A sharp 
boundary at x = introduces high-frequency dynamical modes and thus is incompatible 
with a consistent low-frequency dynamical description; and (2) Subjective inspection and 
barrier identification are not possible in multidimensional problems. Instead, we use this 
example to show how the correct "fuzzy" macrostates can be identified (without subjective 
inspection) by a generalizable algorithm: 

The starting point is the recognition of the spectral gap 71 <C 72. When t > 7J 1 , the 
values of p(x; t) for all x < or all x > will be highly correlated, and relative equilibrium 
within (but not between) these regions will have been achieved. Therefore, in this temporal 
regime p(x; t) can be well-approximated by an expansion within the rank-2 (in general, 
rank-m) macrostate subspace spanned by <po and ipi, and only the first two terms in the 
summation in Eq. (3) need be kept. To obtain a probabilistic description, we replace this 
truncated eigenfunction expansion by an expansion in the alternative basis provided by the 
non- negative macrostate distributions q 9 a {x) and d^x) (to be defined precisely below) shown 
in Fig. 1A, panels c and d. i? (or fib) is approximately proportional to ip for x < (or 
x > 0) and is approximately for x > (or x < 0). Thus, Eq. (3) can be replaced by the 
macrostate expansion 



where Greek letters index macrostates and sums over Greek letters indicate sums over all 
macrostates. (We assume the normalization f $ a (x)dx = 1.) Since $ a and have sig- 
nificant support only for x < and x > 0, respectively, p a (t) and p (t) specify the time- 
dependent amounts of probability in these regions. Their dynamics are described by the 
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coarse-grained macrostate master equation 

dp a (t) 



alt 



H^W^a, (5) 



where Tp a is the macrostate transition matrix. As t — > oo, Eq. (4) reduces to 

lim p(x; t)=ip = Y] p c ^ $ a (x) . (6) 



where is the total probability contained in the macrostate region a at equilibrium. 

The $ a implicitly define the macrostate regions. To make this explicit, we define 
macrostate window functions 

w a {x) = -— . 7 

<Po(x) 

Eq. (6) and the non- negativity of the i? a imply that 

w a (x) >0, Wa,x (8a) 
=1, Vx. (8b) 

a 

w a (x) specifies the probability of assignment of microstate x to macrostate a. The window 
functions corresponding to i? a and ^ are shown in Fig. 1A. They define a fuzzy dissection 
of configuration space into the macrostate regions x < and x > 0. 

Now we can address the precise definition of the d a themselves. Because they span the 
macrostate subspace, they must be linear combinations of the slow eigenfunctions: 

m— 1 

& a (x) = ^M anVn {x) . (9) 

n=0 

Since the ip n are smooth, the i9 a , and hence the w a , must also be smooth. This induces an 
unavoidable uncertainty in microstate assignment. For example, in Fig. 1 the assignments 
are almost certain for \x\ ^> where w a ~ 1, but are highly uncertain for x ~ where 
w a (x) ~ 0.5. The essential point is to choose M, and hence the ^ a and w a , so as to 
maximize certainty: We define T a , the uncertainty of macrostate a, as the sum of its 
equilibrium probability-weighted overlaps with other the other macrostates, relative to its 
total probability: 1 

T = / w a (x)wt3{x)p eci (x) dx ^ 

J w a (x)p CCi (x) dx 



1 This definition is motivated by analysis of the experimental macrostate preparation and measurement 
process [24]. 
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Using Eqs. (6), (7), and (8b), the macrostate certainty T a is 

T Q = 1 - T a = J w 2 a (x)p^(x) dx . (11) 

We choose M so as to maximize the geometric mean of the T a subject to the constraints 
of Eq. (8). This minimum uncertainty criterion minimizes macrostate overlap and, in the 
example of Fig. 1A, results in the i? a and w a shown in panels c and d. The amount of 
overlap of these optimized macrostate functions depends on the magnitude of the spectral 
gap. 



B. Adapting macrostate coarse-graining to data clustering 

To adapt the physical coarse-graining procedure to data clustering, we make the compu- 
tational analogy {microstates, macrostates, T} <-> {items, clusters, f(D~ 1 )}. In this analogy, 
the continuous configuration space of microstates x is replaced by a discrete space of items 
% : 1 < % < N, and the probability distribution p(x,t) is replaced by p(t), the vector of 
individual item probabilities Pi(t) (e.g., see the simple example in Fig. IB). Since pit) is a 
probability vector, it must satisfy 

Pi(t) > , VM, (12a) 
l-p(t) = 1, Vt, (12b) 

where 

h = 1 , Vi . 

By analogy with Eq. (2), we assume that the dynamics in the item-space are described 
by the microscopic master equation 

^ = r- P «, (13) 

where T is a first-order transition matrix. Non-negativity of each Pi(t) under time evolution 
requires that 

r ->o, i^j, (14) 

and conservation of probability requires that 

l-r = 0. (15) 
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The heart of the analogy is to assume that (i ^ j) depends on Dij, the dissimilarity 
between items i and j. If D were embeddable as a distance matrix in a metric space (e.g., as 
when it is computed from a measurement matrix X), then Y could, in principle, be computed 
by solving a multidimensional diffusion equation in the continuous space. However, this 
would be extremely expensive. Instead, we model V from D using the following heuristic 
argument: A starting point is to set Y-ij = (Dij)~ 2 by analogy to the rate of diffusion 
between two isolated microstates in one-dimension. However, this does not account for the 
interception of probability flux by intervening items. To model interception, we use an 
exponential cutoff scaled to the mean nearest-neighbor squared-distance (d^): 

e -(A,) 2 /2<d§> 

r v = — 777^ — , iy^j, (16a) 

N 

( d D = A^E(^<) 2 > ( 16b ) 
i=i 

where D iK is the smallest element in the i th row of D. The diagonal elements of T are fixed 
by Eq. (15). 

T defined by Eq. (16) is symmetric, so its left- and right-eigenvectors are identical. There- 
fore, Eq. (15) implies that 

T-1 = 0, (17) 
and the equilibrium probability vector p cq is 

p cq = 1 . (18) 

Eq. (14) and the symmetry of T imply that all its eigenvectors ip n are orthogonal and that 
all its eigenvalues — 7„ are non-positive (see Appendix B). It is convenient to use bra-ket 
notation to indicate the renormalized inner product, 

(x\y) = N- 1 x ■ y , (19) 

and to normalize the eigenvectors so that 

(^n|Vm> = $nm ■ (20) 

Then, 

<p = l. (21) 
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Fig. IB illustrates cp and <pi computed in this way for a simple case of N = 20 items in a 
1-dimensional measurement space. 

Because all the elements of cp are identical, the vector analog of Eq. (7) is trivial and 
the macrostate distributions and window functions are directly proportional to each other. 
Therefore, we simplify by expressing the m cluster window vectors directly in terms of the 
m slow eigenvectors (for now we assume that m has been specified): 

m— 1 

W a = M *n Vn ■ (22) 

n=0 

Analogously to Eqs. (8), the w a satisfy the positivity and summation constraints required 
for a probabilistic interpretation: 

Wi > 0, Va,i , (23a) 
£>« = I- (23b) 

a 

Eqs. (21) and (23b) and the orthonormality of the eigenvectors implies the m summation 
constraints on M 

Y J M an = 5 nQ . (24) 

a 

By analogy to Eq. (11), the certainty of cluster a is 

Tq(m) = <^K> (25) 

As in the continuous case, < T Q < 1. Maximizing the geometric mean of the T a is 
equivalent to minimizing the objective function 

*(M) = -J>gT a (M). (26) 

a 

Minimization of $ consistent with the linear constraints of Eq. (24) and the linear in- 
equality constraints of Eq. (23a) fixes M, and hence the w a , for a specified value of m. The 
solution of this global optimization problem is discussed in Appendix A. There we show 
that the resultant w a are linearly independent, so they provide a complete basis for the 
macrostate subspace. Once the w a have been computed we complete the clustering proce- 
dure for m by assigning each item i to the cluster a having the maximal value of {w a )i- We 
say that the assignment is "strong" or "weak" depending on how close this maximal value, 
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the item assignment strength, is to 1. The assignments are extremely strong for the example 
shown in Fig. IB (re panels c and d) because of the relatively large separation between the 
two clusters. 

In some cases, the procedure may define a cluster with only a single item. In this case 
r iocai j g undefined an d there is no meaningful dissection of dynamics into slow and fast 
processes. Therefore we treat such outliers by a special procedure: When one is identified, 
it is removed from the dataset and the pruned dataset is reanalyzed. The pruning procedure 
is repeated if new outliers appear. We designate the final clustering as C(m). 

C. Determining the number of clusters 

We use two conditions to determine if C(m) is an acceptable clustering: As motivated 
above, we examine the eigenspectrum of V for spectral gaps, which are defined as 



where p 7 is the minimum gap parameter. However, Eq. (27) alone may accept excessively 
fuzzy clusters having weak item assignment vectors. To eliminate these, we supplement Eq. 
(27) with the minimum macrostate certainty conditions: 



We have found that choosing p 7 = 3 and px — 0.68 (the fraction of the normal distribution 
contained within one standard-deviation of the mean) works well for all the problems that 
we have tested (see Results). 

The complete algorithm is to sequentially compute C{m) for m — 2, 3, . . . and to test 
these clusterings for acceptability according to Eq. (28). If multiple clusterings are accept- 
able, we will usually be most interested in the one of largest to, since it provides the finest 
resolution. As a practical matter, if C(m) is not acceptable for three consecutive to's we 
assume that it will not be acceptable for higher to's and terminate the analysis. 

D. Computational implementation 

Only two steps in the procedure are potentially expensive: computing the slow eigen- 
vectors and eigenvalues of T and the global minimization of $(M). Since we only use a 



1m/lm—l ^ P7 ; 



(27) 



(28) 
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relatively small number (typically m < 10) of slow eigenvectors, it suffices to compute only 
these via the Lanczos method [27] at cost ~ 0(N 2 ). The global minimization of $(M) is a 
linearly constrained global optimization problem in m(m — 1) dimensions. The number of 
vertices of the feasible region-bounding polytope increases with N, formally as a polynomial 
dependent on m. However, at least for the problems tested here, a simple solver is adequate 
(see Appendix A). 

III. RESULTS 

We tested the method on a number of problems that have challenged other clustering 
methods. Bivariate problems in which the data set can be graphically displayed in two di- 
mensions were used to enable subjective evaluation of the quality of the results. In addition, 
to check that performance did not depend on low dimensionality of the data space, we tested 
problems where the items were embedded in a 20-dimensional space. 

A. Bivariate test-cases 

The algorithm was evaluated on four previously described difficult test-cases. In each case, 
the dataset consisted of Nm = 2 measurements on each of N items. These can be represented 
as N points in a 2-dimensional space. For example, the "crescentric" clustering problem 
shown in Fig. 2a consists of 52 items, each represented as a point in the 2-dimensional 
measurement space. The two clusters are closely juxtaposed crescents, which makes the 
problem difficult [2, 28]. The D matrix was computed from the coordinates using Eq. 
(1) with g ab = 5 ab , and T was computed from D according to Eqs. (16). The slowest 
eigenvectors, ifo, <pi, and cp 2 , are graphically displayed in panels b, c and d, respectively. As 
per Eq. (18), all components of ifo are identical. It is gratifying to see that (fi clearly reflects 
the two-cluster structure: the components of all the items in the bottom-right crescent are 
positive, while the components of all the items in the other crescent are negative. The next 
eigenvector, <p 2 , has three localized regions of same-sign components. Subjectively, it is 
clear that separating into these regions would overdissect the space. As predicted by the 
discussion above, these eigenvector properties in the spatial domain correspond in the time 
domain to an eigenspectrum gap between 71 and 72 (Fig. 2 and Table I). In contrast, there 
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is no gap between 72 and 73 (Fig. 2). This suggests that the m — 2, but not the m = 3 
clustering will be acceptable. 

The task for the algorithm is to recognize that the correct clustering is embedded in the 
structure of <pi, and to define the proper clustering. Applying it for m — 2, 3, . . . yields 
the clusters shown in the top panels of Fig. 3. (For illustration, we display clusterings that 
do not satisfy the spectral gap condition, even though these would not be computed by an 
efficient algorithm.) The cluster certainties T Q are listed in Table I. The m = 2 clustering 
satisfies both Eqs. (27) and (28), and all clusterings with m > 2 fail both criteria. Therefore, 
the algorithm correctly selects m = 2 clusters. The individual item assignment strengths for 
this clustering are displayed in Fig. 4; most are in the range of 0.7 — 0.9, indicating that there 
is significant fuzziness resulting from the close juxtaposition of the clusters. Nonetheless, all 
the item assignments are made correctly. 

Three other test problems were analyzed in the same way: (1) The "intersecting" problem 
consists of two barely-contacting sets of items having highly anisotropic Gaussian distribu- 
tions. It has previously been used to demonstrate the weakness of non-parametric opti- 
mization clustering for clusters of greatly different shapes and sizes [2]. (2) The "parallel" 
problem consists of two highly extended, anisotropic sets of items whose separation along 
the vertical axis is much smaller than their horizontal extent. It has previously been used 
to demonstrate the failure of agglomerative hierarchical methods [2]. (3) The "horseshoe" 
problem [3] consists of a central cluster of items surrounded by a horseshoe-shaped cluster 
of items. The center-of-mass of the outer cluster lies within the inner cluster, increasing dif- 
ficulty. In addition, a "random" test set, in which points were randomly distributed within 
a square two-dimensional region, was analyzed as a control. 

The results obtained for m — 2, 3, 4, and 5 are illustrated in Fig. 3. The acceptable 
clusterings that satisfy Eqs. (27) and (28) are outlined by dark boxes. Only a single clustering 
is acceptable in each case (although this need not be so in general). None of the random 
control clusterings are acceptable, correctly indicating that it should not be clustered. 

As with the crescentric problem, the clustering solution for the "horseshoe" test-case 
(fourth row, Fig. 3) is straightforward, with m — 2. Cluster certainties (Table II) and 
item assignment strengths (Fig. 4) are extremely strong (> 0.99). The "parallel" problem 
is slightly more challenging in that two of the items (located at the extreme left and right 
sides of the item distributions) are identified as outliers. Nonetheless, the algorithm correctly 
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identifies the m = 2 clustering of the non-outlying items. As expected, the item assignment 
strengths are lower for the items in the central overlapping region, and higher for the non- 
overlapping items near the left and right edges (Fig. 4). 

The solution to the "intersecting" problem is more elaborate: While the m = 2 solution 
is subjectively acceptable, the assignment strengths of some of the items in the vicinity 
of the intersection have weak item assignment strengths. Because of this, the m = 2 and 
m = 3 clusterings do not satisfy the required assignment certainty condition Eq. (28) and 
are rejected by the algorithm. The acceptable m = 4 clustering resolves this difficulty 
by segregating these uncertain items into a separate small cluster. It also segregates two 
outliers (in the top-right corner) while assigning most of the items to two major clusters, as 
desired. The individual item assignment strengths are strong, except for one item near the 
intersection of the three clusters (Fig. 4). 

None of the C(m) are acceptable for the "random" distribution of items because all of the 
lm/lm-i were < 2.5 for m > 1. Thus, the algorithm is not fooled into spurious clustering. 

B. Gaussians with varying overlap in two and twenty dimensions 

We systematically tested the performance of the algorithm as a function of the relative 
distance between clusters. For this purpose, four pseudo-random groups of 50 items were 
generated with Gaussian kernels having variance Xj. The centers-of-mass of the groups were 
themselves pseudo-randomly selected from a Gaussian kernel having variance X 2 g (see Fig. 
5). The corresponding ratio of the expected root-mean-square (rms) intercluster item-item 
separations to the rms miracluster item separations is 



Tests in a bivariate measurement space were conducted for X g f Xz varying from 16 (where the 
clusters were highly separated) down to 2 (where the clusters were completely overlapping). 
The algorithm dissects the items into four clusters when X g /Xe = 16. When X g /X^ = 8 
and Xg/Xi = 4, the top two groups partially merge (see Fig. 5), and the algorithm accord- 
ingly reports m = 3 clusters. The clusters are not subjectively separable for X g /Xe = 2; 
correspondingly, the algorithm reports m = 1 cluster. The assignment strengths for these 
clusterings are displayed in Fig. 5. 




(29) 
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The same test was performed with four groups generated as described above using Gaus- 
sian kernels in a 20-dimensional space. The increased dimensionality does not alter Eq. (29). 
However, the distributions of the inter- and intra-group squared-distances are narrower: the 
standard deviations of the inter- and intra-group (AR) 2 normalized by their means are both 
smaller by factors of a/20/2 = \/l0. Therefore, for a given value of X g /Xe, clustering is 
actually easier in higher dimensionality. To compensate and make the 20-dimensional test 
more challenging, the range of \ g /\e was reduced by a factor of 4 (roughly matching \/l0); 
i.e., X g /X f was varied from 4 down to 0.5. The algorithm correctly identifies the four clusters 
for Xg/Xe = A and X g /Xe = 2. The individual item assignment strengths of these clusterings 
are displayed in Fig. 6. These are all close to one for X g /Xt = 4 and X g /Xe = 2, indicating 
unambiguous clustering. At smaller values of X g / Xe, the only clustering satisfying both the 
minimum gap and minimum certainty conditions has one cluster containing all the items. 
Even so, for X g /Xi = 1, the (formally unacceptable) m = 3 clustering correctly reflects some 
of the group structure (Fig. 6). 

IV. DISCUSSION 

We have shown that macrostate clustering performs well on a variety of test problems 
that have challenged other methods. The method only needs a dissimilarity matrix D 
(not a data matrix X) and has the advantage of being non-hierarchical, 2 which should 
improve performance in general. Beyond identifying potential clusterings, it uses internal 
criteria — the eigenspectrum gaps ^m/lm-i and the cluster certainties T a — to determine the 
appropriate number of clusters. The corresponding acceptance parameters, p 1 and pr, were 
empirically determined and gave robust performance — a single choice worked well for all the 
problems tested. 

Eigenvectors have previously been used for clustering by many different spectral graph 
theory (SGT) partitioning methods: SGT bipartitioning methods use the values of <p\ to 
define a one-dimensional ordering of the items which can then be divided by a heuristic. A 

2 For example, the m = 5 "crescentric" clustering can not be obtained by subdividing its m — 4 clustering 
and the m = 4 "horseshoe" clustering is not hierarchically related to its m — 3 clustering. Nevertheless, 
inherent hierarchical structure can still emerge, and some was evident in all the problems. For example, 
all the clusterings for 2 < m < 5 for the "intersecting" and "parallel" problems are hierarchically related 
(Fig. 3). 
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variety of different approaches have been developed to extend this to multiple eigenvectors 
and clusters ([18, 19, 21, 23], for review). For example, recursive spectral bipartitioning 
generates a hierarchical binary tree [22]; some methods use k eigenvectors to define 2 k clusters 
[29]; and many methods project the items into the subspace spanned by k eigenvectors and 
then use a partitioning heuristic to identify clusters within the subspace (e.g., [8, 19, 23, 30, 
31, 32] and references therein). 

Macrostate clustering differs in that it computes continuous (fuzzy) assignment window 
vectors rather than partitionings. 3 This has important ramifications: It permits the window 
vectors to be expressed as linear combinations of the eigenvectors [see Eq. (22)]; this neces- 
sarily results in window function overlap and cluster uncertainty. Combining these concepts 
with the principle of uncertainty minimization provides a simple prescription for the con- 
current use of multiple eigenvectors in clustering. A related difference is that the number of 
clusters is internally determined in macrostate clustering, while it is usually fixed a priori 
or determined by eigensystem- independent heuristics in SGT methods (e.g., [18, 19, 23] and 
references therein). It is perhaps surprising that the spectral gap condition has not been 
used for this purpose in SGT approaches. 4 This may reflect the fact that it does not work 
well by itself, and the companion minimum cluster certainty condition is not available when 
(crisply) partitioning. Macrostate and SGT clustering also differ in the manner in which V 
(or the SGT analog) is computed from the dissimilarity matrix D. SGT methods typically 
use a weight matrix equivalent to = exp(— Dij/a), i ^ j, where a is an empirically- 
determined scale constant. In contrast, motivated by the analogy to a diffusive system, 
we used Eqs. (16). While this difference not of fundamental significance, the relationship 
between V and D can affect performance. Thus, it may be helpful to test the use of Eqs. 
(16) in SGT methods or the SGT relationship in macrostate clustering. 

The use of a linear transformation from indefinite, orthogonal eigenvectors to semidefi- 
nite, non-orthogonal window vectors is fundamental, but some freedom remains in the choice 
of the objective function used to determine the optimal transformation and in the condi- 
tions used to determine acceptable clusterings. An uncertainty minimization criterion is a 

3 Drineas et al. [6] consider real-valued "generalized clusters" within a SGT context, but these are indefinite 
and do not have a probabilistic interpretation. 

4 However, spectral gaps have been used heuristically to determine the appropriate dimensionality of singular 
subspaces in data mining [33]. 
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natural choice, since it is (in an information-theoretic sense) the entropic counterpart to the 
(implicit) "energy" minimization criterion that focuses attention on the slow eigenvectors 
(see Sec. II. C of [26]). On-the-other-hand, the definition of uncertainty could be modified 
and tested for improved performance. Similarly, while we believe that it is advantageous to 
combine energetic (spectral gap) and entropic (cluster certainty) conditions in determining 
the number of clusters, it may be possible to improve upon the specific criteria used here. 

Other improvements and extensions merit attention: (1) While we accepted or rejected 
each clustering in toto, it may be useful in some cases to examine incomplete clusterings in 
which only some of the clusters satisfy the cluster certainty condition. This modification 
would enable the algorithm to resolve all four clusters for the case of A 9 /A£ = 8 in Fig. 5. 5 (2) 
The individual item assignment strengths (w a )i measure the certainty of each assignment, 
but their precise statistical significance is not known. It would be helpful to have a model 
for assessing this. (3) The cluster transition matrix 7^ = (w/3\r\w a ) can be used to assess 
the strength of the relationship between the clusters and may be useful in setting the cluster 
acceptance criteria. (4) We have defined T as a symmetric matrix, which implies that 
p cq oc 1. However, this restriction is not required: The generalization to asymmetric T is 
straightforward [24] and it could be used to incorporate additional experimental information. 
For example, if item i is known a priori to be partially redundant with other items (e.g., 
when analyzing expression levels of members of gene families), it may be given reduced 
weight in the analysis by setting pf 1 < I. 

Our main goal has been a proof-of-principle demonstration of the high quality of the 
clusterings provided by the dynamical macrostate approach. The current implementation is 
sufficiently efficient for problems where N ~ O(10 2 ), but we have not examined performance 
for very large problems. The continuous formulation replaces the NP-hard combinatoric SGT 
partitioning problem with a global minimization problem having polynomial complexity in 
N. However, the order of the polynomial can be very large for large m (Appendix A) so, 
formally, this is not much of an improvement. Nonetheless, as discussed in Appendix A, 
because the objective function is smooth and the constraints are highly degenerate, a simple 
solver has worked well and we believe that it will be possibly to obtain adequate approximate 

5 The m = 5 solution identifies the four major clusters with strong certainty, but also groups three items 
(located near the boundary between the two top clusters) into a fifth cluster which has T a < p r . In an 
incomplete clustering, all but these three items would be unambiguously assigned. 
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solutions efficiently even for very large problems. This remains to be examined. 
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APPENDIX A: MINIMIZING $(M) 

$(M) is to be minimized as a function of the m 2 elements of M an within the feasible re- 
gion specified by the mx N linear inequality constraints of Eq. (23a). The rows of M can be 
regarded as the coordinates of m particles in the m-dimensional space of the slow eigenvec- 
tors. Designating the coordinate row vector of particle a as M a = (M a Q, M a i, . . . M a ( m _i)), 
M is the outer product of the M a 's: 

M = (g)M a . (Al) 

a 

The equality constraints of Eq. (24) imply that the center-of-mass of the m particles is at 
position 

-£^ = -' (A2) 

a 

where e is the unit vector in the th direction: 

e = (l, 0, ...,0). (A3) 

[Eq. (A3) must be modified when there is more that one stationary eigenvector; see Appendix 
B.] The feasible region is a polytope in the m(m — 1) dimensional subspace where Eq. (A2) 
is satisfied. 

The minimum of $(M) must lie at a vertex of this polytope. Proof: The gradient of $ 
with respect to M a is 

V a $ = = -2=^- + zzr- 5 — , (A4) 
5M a \M a \ 2 M a oe 



and the Hessian is 

^ 2 3> 21 M a ®M a e ®s 1 

4 — =f h — , (A5) 



V a ® EE — — = -6 a p 



|M Q | 2 |M a | 4 (M Q oe ) 5 
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where / is the m x m identity matrix and o denotes the inner product over the eigenvector 
indices, 

m— 1 

y = ^2 Xn Vn 



X O 



In - 

?1=0 



The gradient does not vanish anywhere, so $ has no minimum in the absence of constraints. 

In fact, a minimum will occur only when all m 2 degrees-of-freedom are constrained by 
the m equality constraints and m(m — 1) inequality constraints. To see this, consider the 
situation without the equality constraints, but with some number c < m(m — 1) of active 
inequality constraints. Each active inequality constraint acts on a single w a , so by Eq. (22) 
it acts on a single M a to enforce 

M a o = , (A6) 

where Tp is the supervector having components (cpo, cp±, . . . , (p m -i). Therefore, the inequal- 
ity constraints are separable and, similarly to Eq. (Al), the space of inequality-constrained 
M's can be expressed as the outer-product of the subspaces of inequality-constrained M a 's. 
Thus, if M c = a is an inequality-constrained minimizer of $, it must be stable with 
respect to independent variations of each of the inequality-constrained M^. However, this 
is not possible: For any such variation M c a — > M c a + 8 a , the existence of a minimum would 
require that 

S a o V«$ = (A7) 

and 

4o(V Q ® V )$o5 > . (A8) 

However, Eqs. (A4) and (A7) imply that 

M a o S a _ S a o i 
\M a \ 2 2M a oe ' 

and combining this with Eq. (A5) implies that 

— > — > 2\S I 2 

4 o (v a ® v a )$ o 4 = -4^- < o . 

Thus, Eqs. (A7) and (A8) cannot both be true. Therefore, a minimum can occur only if all 
variations of the M a are prevented by a combination of inequality- and equality-constraints. 
Since there are only m equality constraints, we must have c = m{m — 1) active inequality 
constraints. This corresponds to a vertex of the feasible region. 
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Note also that the minimizing {M c a } must be linearly independent within the Tri- 
dimensional slow eigenvector space. This implies that the associated {w a } must span the 
macrostate subspace. Proof: If the {M^} are not independent, there would exist a linear 
combination of vectors such that 

a 

Then, the combined variation 

M c a ^M c a + 5^M c a , Va, 

where 5 is a small number, will not affect the equality constraint Eq. (A2). As proven 
above, all the components of the constrained minimum must be fixed by constraints, so this 
variation must be excluded by an inequality constraint. However, this variation only rescales 
each M c a and hence each w a . Therefore it also will not affect the inequality constraints and 
is permitted, contrary to assumption. Reductio ad absurdum. 

To find the vertex with the lowest value of <3>, we used a simple minimizer that operates 
in the m(m — l)-dimensional subspace that remains after one of the M a has been explicitly 
eliminated using Eq. (A2). The minimizer starts from M a = io/m, Va, chooses a random 
direction in the m{m — l)-dimensional space, proceeds to the nearest inequality constraint, 
and then proceeds along faces of the feasible region (of decreasing dimensionality) until a 
vertex is reached. This process was repeated until the same extremal minima was found 
three times or for a minimum of 500,000 trials, whichever was greater. 

Accounting for the separability of the inequality constraints and assuming no degeneracies 
between the values of the <p n (the usual case), the number of vertices of the constraining 
polytope might grow as rapidly as 0{N m ). However, we expect that most of the inequality 
constraints of Eq. (23a) will be almost degenerate because of the relatively small differences 
between the components of the eigenvectors at different items within a cluster. Moreover, 
the objective function $ is smooth, so we expect that the variation of $ over nearby vertices 
will be small. Therefore, it will not much affect the w a if a neighbor, rather than the global 
minimizer itself, is found. Thus, we anticipate that the practical growth in computational 
cost with N will be much less than the worst-case bound. These considerations also suggest 
that it will always be advantageous to use solvers that move through the [m{m — 1) — 1]- 
dimensional space of search-space directions rather than between vertices of the constraining 
polytope. 
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APPENDIX B: DEGENERATE "ZERO" EIGENVALUES 



Because T is a symmetric matrix which satisfies Eqs. (15) and (17), 



—x • r • x — 



Xi Xj ) , 



(Bl) 



for any vector x. The right-hand-side (rhs) can be viewed as the potential energy of N par- 
ticles having pairwise quadratic interactions in one-dimension. Because all the off-diagonal 
elements of T are positive, the rhs must be non-negative. The implied non-positivity of 
x ■ T ■ x for all x implies that all the eigenvalues of V must be non-positive. Furthermore, 
the isomorphism makes it evident that x — 1 is the only stationary eigenvector (up to a 
multiplicative constant) unless the dataset contains an isolated subset S, which has = 
if i G S and j £ S. In this case, T will have multiple zero eigenvalues, and there will be 
one stationary eigenvector corresponding to each isolated subset. This degeneracy can be 
removed by analyzing each isolated subset independently. 

It is more common to encounter approximate isolation in which none of the is exactly 
zero but in which there are multiple small eigenvalues that are on the scale of numerical 
accuracy. (This occurs in the Gaussian clustering problem shown in Fig. 5 when \ g /\e is 
large.) This can cause numerical problems: the (fo returned by a numerical eigensystem 
solver will not necessarily satisfy Eq. (21), but instead will be a linear combination of the 
approximately degenerate eigenvectors. Because of this, Eq. (21), and hence Eq. (24), may 
not be true. 

The simplest resolution of this numerical problem is to replace Eq. (24) with Eq. (A2) 
and to replace Eq. (A3) with 



This does not require the numerical validity of Eq. (21). 
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FIGURES 



FIG. 1: Heuristic examples. (A) Identifying the macrostates of a continuous stochastic system in 
one-dimension, a: The potential V(x) and eigenvalue spectrum, b: The zeroth and first excited 
right-eigenfunctions of the corresponding diffusive dynamical (Smoluchowski) equation, c and d: 
The two macrostate distribution and window functions. (B) Macrostate clustering of items in a 
one-dimensional space, a: The positions of the items in the univariate measurement space, b: 
Graphical representation of the zeroth and first eigenvectors of T; the height of the bar at the 
position of item i corresponds to its component within the indicated eigenfunction. c and d: The 
components of the two window vectors corresponding to the left (w a ) and right clusters. 

FIG. 2: "Crescentric" clustering problem and its slow eigenvectors, a: The x and y coordinates 
of each point correspond to two measurement values of the corresponding item, b, c and d: cpo, 
ipi, and (fi2, respectively. For illustration, the amplitude of the i th component of each (p n is 
represented by the height (if positive) or depth (if negative) of a cone centered at position i. The 
relative magnitudes of the corresponding eigenvalues are indicated. 

FIG. 3: Bivariate test-cases. The algorithmically-determined clusterings C(m) for 2 < m < 
5 are displayed for four bivariate examples in which the items are points in a two-dimensional 
measurement space. Clusters are distinguished by different symbols, except that unfilled squares 
identify items that were designated as outliers by the algorithm. The acceptable clusterings, which 
satisfy Eqs. (27) and (28), are outlined by dark boxes. 

FIG. 4: Item assignment strengths for the acceptable clusterings. The acceptable clusterings for 
each of the problems in Fig. 3 are shown. The height of the dark section of the bar relative to its 
total height at the position of an item indicates its assignment strength. 
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FIG. 5: Clustering of Gaussian-distributed items in two dimensions for various cluster separations. 
Top: The unique acceptable clustering for each value of \ g /\t is indicated. Bottom: The height 
of the dark section of the bar at the position of an item indicates its assignment strength. (Most 
of the strengths are « 1). 

FIG. 6: Item assignment strengths for cluster solutions for various group separations in 20 dimen- 
sions. Items were pseudo-randomly distributed into four groups in a 20-dimensional measurement 
space for different values of \ g /\e as described in the text. The items within each group have 
consecutive serial numbers (i.e., items 1-50 are in the first group, 51-100 are in the second group, 
etc.). Their assignment strengths for the indicated C{m) clusterings are displayed in the each case. 
(Item 171 is an outlier for both clusterings shown in the bottom row; hence it is not assigned to 
any cluster.) However, only the m = 4 clusterings for \ g /\t = 4 and \ g /\e = 2 are acceptable; the 
C(3) and C(2) shown in the bottom panels fail the acceptability conditions of Eqs. (27) and (28) 
because of their low cluster certainties. 
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TABLES 



TABLE I: Crescentric cluster analysis 



m T a (m) 

2 3.52 0.71 

0.70 

3 1.12 0.67 

0.41 
0.53 

4 2.73 0.83 

0.81 
0.51 
0.53 

5 1.03 0.71 

0.47 
0.55 
0.38 
0.38 
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TABLE II: Bivariate test case analyses 



Problem m 7m T n (m) 
Crescentric 2 3.52 0.71 

0.70 

Intersecting 4 3.82 0.91 

0.95 
0.84 
0.94 

Parallel 2 10.68 0.93 

0.93 

Horseshoe 2 60.73 0.998 



0.99 



Random 
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